LSimpute: accurate estimation of missing values in microarray data with least squares methods.

نویسندگان

  • Trond Hellem Bø
  • Bjarte Dysvik
  • Inge Jonassen
چکیده

Microarray experiments generate data sets with information on the expression levels of thousands of genes in a set of biological samples. Unfortunately, such experiments often produce multiple missing expression values, normally due to various experimental problems. As many algorithms for gene expression analysis require a complete data matrix as input, the missing values have to be estimated in order to analyze the available data. Alternatively, genes and arrays can be removed until no missing values remain. However, for genes or arrays with only a small number of missing values, it is desirable to impute those values. For the subsequent analysis to be as informative as possible, it is essential that the estimates for the missing gene expression values are accurate. A small amount of badly estimated missing values in the data might be enough for clustering methods, such as hierachical clustering or K-means clustering, to produce misleading results. Thus, accurate methods for missing value estimation are needed. We present novel methods for estimation of missing values in microarray data sets that are based on the least squares principle, and that utilize correlations between both genes and arrays. For this set of methods, we use the common reference name LSimpute. We compare the estimation accuracy of our methods with the widely used KNNimpute on three complete data matrices from public data sets by randomly knocking out data (labeling as missing). From these tests, we conclude that our LSimpute methods produce estimates that consistently are more accurate than those obtained using KNNimpute. Additionally, we examine a more classic approach to missing value estimation based on expectation maximization (EM). We refer to our EM implementations as EMimpute, and the estimate errors using the EMimpute methods are compared with those our novel methods produce. The results indicate that on average, the estimates from our best performing LSimpute method are at least as accurate as those from the best EMimpute algorithm.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Missing Value Estimation for DNA Microarray Expression Data: Least Squares Imputation

Motivation: Gene expression microarray data sets often contain missing expression values. Robust missing value estimation methods are needed since many algorithms for gene expression analysis require a complete matrix of gene array values. In this paper, imputation methods based on the least squares and cluster structure are proposed to estimate missing values in the gene expression data, which...

متن کامل

Missing Values Estimation in Microarray Data with Partial Least Squares Regression

Microarray data usually contain missing values, thus estimating these missing values is an important preprocessing step. This paper proposes an estimation method of missing values based on Partial Least Squares (PLS) regression. The method is feasible for microarray data, because of the characteristics of PLS regression. We compared our method with three methods, including ROWaverage, KNNimpute...

متن کامل

Missing value estimation for DNA microarray gene expression data: local least squares imputation

MOTIVATION Gene expression data often contain missing expression values. Effective missing value estimation methods are needed since many algorithms for gene expression data analysis require a complete matrix of gene array values. In this paper, imputation methods based on the least squares formulation are proposed to estimate missing values in the gene expression data, which exploit local simi...

متن کامل

BIOINFORMATICS Collateral Missing Value Imputation: A New Robust Missing Value Estimation Algorithm For Microarray Data

Motivation: Microarray data is used in a range of application areas in biology, though often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimate these values as accurately as possible prior to using these algorithms. While many imputation algo...

متن کامل

Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data

MOTIVATION Microarray data are used in a range of application areas in biology, although often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimate these values as accurately as possible before using these algorithms. While many imputation algo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Nucleic acids research

دوره 32 3  شماره 

صفحات  -

تاریخ انتشار 2004